Selecting a Two-Group Classification Weighting Algorithm: Take Two

نویسندگان

  • John D. Morris
  • Mary G. Lieberman
چکیده

The two-group cross-validation classification accuracies of six algorithms (i.e., least squares, ridge regression, principal components, a common factor method, equal weighting, and logistic regression) were compared as a function of degree of validity concentration, group separation, and number of subjects. Therein, the findings of two previous studies were extended to the latter three methods with particular interest in how logistic regression faired as a function of validity concentration. In respect to validity concentration, as well as group separation and N, logistic regression was a mirror image of least squares. The same relative decrease, in respect to alternate methods, in accuracy with increasing validity concentration previously evidenced with least squares was observed. However, the large number of samples in which logistic regression failed to yield a solution may be a cause for concern. his investigation extends Morris and Huberty (1987) by contrasting the accuracy of six algorithms for classifying subjects into one of two groups. Ordinary Least Squares (OLS), Ridge Regression (Ridge), Principle Components (PC), Pruzek and Frederick’s (1978) common factor method (Pruzek), Equal Weighting (Equal), and Logistic Regression (LR) are the techniques that were compared. Darlington (1978) posited that regression cross-validation accuracy is a function of R 2 , N, and validity concentration, where R 2 represents the squared multiple correlation and N is the sample size. In Darlington’s formulation, validity concentration was used to describe a data condition in which the principal components of the predictors with large eigenvalues also have large correlations with the criterion. Thus, validity concentration requires at least a modicum of predictor variable collinearity (i.e., large predictor eigenvalues), but collinearity is only necessary, not sufficient, for validity concentration. Darlington suggested that the most useful statistical technique for practical prediction problems, therein, may be ridge regression. Through simulation, Morris (1982) re-examined the performance of Ridge with the same data structures on which Darlington posited the technique's superiority. With large validity concentration, Ridge was, indeed, more accurate than OLS, but contrary to Darlington's suggestions, Ridge was never the most accurate prediction technique. That is, in each case in which Ridge surpassed OLS due to large validity concentration, other contending weighting methods surpassed it. Morris and Huberty (1987) examined a subset of the methods (e.g., OLS, Ridge, and PC) considered in Darlington (1978) and Morris (1982) in the context of two-group classification accuracy, rather than regression, using the same simulated data conditions, but extending them to three different population model accuracies and two different sample sizes; not including the Pruzek nor Equal methods as in Morris (1982). The present study uses the same data conditions as in Morris and Huberty (1987), but includes the Pruzek and Equal methods. In addition, Logistic Regression (LR), an increasingly popular classification technique, was included. The same difficulties with multicollinearity (well-known with the OLS algorithm), have been mentioned (Hosmer & Lemeshow, 2000) as theoretically problematic to LR. Furthermore, ridge techniques have been suggested as a solution (Schaefer, 1986). However, the degree of effect of multicollinearity, and potentially resulting validity concentration, on the classification accuracy of LR has not been empirically and systematically examined.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Optimal Feature Extraction for Discriminating Raman Spectra of Different Skin Samples using Statistical Methods and Genetic Algorithm

Introduction: Raman spectroscopy, that is a spectroscopic technique based on inelastic scattering of monochromatic light, can provide valuable information about molecular vibrations, so using this technique we can study molecular changes in a sample. Material and Methods: In this research, 153 Raman spectra obtained from normal and dried skin samples. Baseline and electrical noise were eliminat...

متن کامل

Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine

We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...

متن کامل

Determining the effective features in classification of heart sounds using trained intelligent network and genetic algorithm

Heart diseases are among the most important causes of mortality in the world, especially in industrial countries. Using heart sounds and the features extracted from them are among the non-aggressive diagnosis and prognosis methods for heart diseases. In this study, the time-scale, Cepstral, frequency, temporal and turbulence features are saved and extracted from the heart sounds, and then they ...

متن کامل

A Novel Scheme for Improving Accuracy of KNN Classification Algorithm Based on the New Weighting Technique and Stepwise Feature Selection

K nearest neighbor algorithm is one of the most frequently used techniques in data mining for its integrity and performance. Though the KNN algorithm is highly effective in many cases, it has some essential deficiencies, which affects the classification accuracy of the algorithm. First, the effectiveness of the algorithm is affected by redundant and irrelevant features. Furthermore, this algori...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012